Individual Poster Page

See copyright notice at the bottom of this page.

List of All Posters

 


How are Runs Really Created - Third Installment

September 18, 2002 - Brian Blake

My compliments to Tango on an excellent and very educational series of articles.

I realize this was covered to some degree particularly in the first article, but there does seem to be some confusion among many readers as to why Base Runs is a superior model in theory.

Perhaps it is worth highlighting again here at the end.

All run creation models generally begin with, as known quantities, the number of hits, singles, doubles, triples, HRs, walks, etc., and attempt to apply some formula to these numbers to estimate how many runs will be scored. The formula that is applied to these, however, always contains some degree of inacuracy.

In any run estimator, the coefficients which are applied to these numbers are estimates. And no matter how much work is done to refine them, they will always be estimates which introduce some degree of inaccuracy and which cause unacceptable results under certain conditions.

One impotant thing Base Runs has done is to remove 2 KNOWN quantities from these calculations, thus reducing the level of error.

One thing that is KNOWN is that every HR scores at least one run, that of the batter who hit it. Therefore, there is no reason to multiply this run by any coefficient. Doing so only increases the inacuracy of the model. In Base Runs, therefore, this part of the HR is removed from the estimated part of the calculation and instead, its TRUE value (1 for each HR) is added in the end.

The other thing that is known is the total number of baserunners. The first part of the calculation (H+BB-HR) calculates this KNOWN quantity. It is important to note that these 2 parts of the calculation are not only quite simple and elegant, but also 100% accurate. No estimate has yet been applied.

It is in the middle part of the equation, the calculation of the score rate, that things get more complex, and some degree of estimation and innacuracy is introduced. But here Tango has made clear that this part of the calculation is preliminary, and that he believes it may be improved on in the future.

And I would add that not only is it possible that it will be improved in its theoretical accuracy, it is also possible that it might be improved in its simplicity, so that a version of Base Runs might well be created that would be best even for the "back of the envelope" calculations which many prefer.

Once the score rate of baserunners is known, the formula (H+BB-HR)*rate+HR would be quite simple to calculate.

Finally, in the calculation of this ratio as presented in the article, B/B+outs, where B = (.8*1B + 2.1*2B + 3.4*3B + 1.8*HR +.1*BB), the calculation of B is no more complicated than that used by other run estimators which apply coefficients to each of these events. Because these coefficients are determined in a somewhat similar manner to other estimators, from real world data, they are also estimates and might be likely to introduce some degree of error.

But unlike some other estimators, these coefficients do not represent the VALUE of these events. Instead, they represent only the impact on the calculation of the overall rate of succesfully scoring baserunners.

Also, while it might seem counter intuitive at first that the impact of the HR (with a coefficient of 1.8) is less than that of the triple (with a coefficient of 3.4), it is important to remember that we already elliminated the run scoring value of the HR itself from this part of the calculation. Therefore, The HR coefficient only includes its impact in terms of driving in other baserunners. While the other coefficients include both the driving in impact, plus the scoring impact (the chance that that new baserunner will score).

Thus in theory, the triple likely has roughly the same driving in impact on this ratio as the HR (because a triple would clear the bases), thus 1.8, and the additional 1.6 (for a coefficient of 3.4) would be due to the fact that adding a runner on third base would increase the overall score rate due to the greater likelihood of scoring a runner from 3rd base, as opposed to 2nd or 1st.

On the other extreme, the values for walks and singles are also entirely due to their driving in or moving over value. This is because a runner on first is NOT more likely to score than the average runner. This is why the coefficient for walks is so small. This does not suggest that additional walks do not increase scoring. They do, but this is already accounted for in the first part of the formula which increaes for each additional baserunner.

Perhaps one way this formula could be refined for those who are interested, might be to attempt to seperate the run scoring, or the moving over(driving in) components, though I expect this would only increase the complexity.

Another idea, which might make the score rate calculation a bit more intuitively satisfying, though more mathematically complex, might be to present it as the weighted average of the seperate score rates for the single, double, triple, and HR. Thus rather than saying a triple "increases the score rate of the average baserunner" we would be saying "take the %of singles multiplied by the score rate for singles, plus the % of doubles multilied by the score rate for doubles, etc."

But this would also dramaticaly increase the complexity of the formula, as the score rates for each event are dependent on the others; if there are more triples, more runners who single or walk will be driven in for example, so you would have a similar lengthy calculation for each event.

One alternate way of presenting the formula for B which does get across to some degree what is going on here, is to say:

B = 8.2 * [.1(1B) + .26(2B) + .41(3B) + .22(HR) + .01(BB)]

All I've done here is to take the some of the coefficients, 8.2, and back that out by dividing each coefficient by that sum, in order to express each coefficient as a percentage. The coefficients now sum to 1. Multiplying this out will produce the original formula. The only slight differences in reslts in using this formula would be caused by rounding error.

By treating the 8.2 as a constant, we can see clearly the relative effect of each event, in percentage form, on the score rate. I think it is clear from Tango's data that this model is more accurate across different run environments then those models which treat the marginal value of each new event as static, and it should be, as each event in this formula increases the rate of scoring.

But while Tango has shown his data grouped by HR's, OBP, and OPS, and demonstrated a reasonable degree of accuracy there, I would like to also see the data shown grouped by number of triples in a game, number of walks in a game, number of doubles in a game.

The reason I would like to see this is that I think it might better demonstrate how accurate or innacurate those coefficients are, and thus might help point the way towards further refinement of this part of the formula.

It seems that if there is any inaccuracy here, it is likely in either the constant, 8.2, or in the relative weights assigned to the impact of the individual events.

We would expect groupings of HRs to work well, as a major part of the HR has been backed out of the part of the calculation that is prone to error. Likewise, if the balance between individual events, like walks singles, doubles, and triples is off somewhat, this will not likely show in groupings by OBP and OPS, because games with high relative rates of these events will likely be distributed fairly evenly across the groupings.

Tango, you've shown us the data for where this works well, now show us the data for where it might not (yet)!


Copyright notice

Comments on this page were made by person(s) with the same handle, in various comments areas, following Tangotiger © material, on Baseball Primer. All content on this page remain the sole copyright of the author of those comments.

If you are the author, and you wish to have these comments removed from this site, please send me an email (tangotiger@yahoo.com), along with (1) the URL of this page, and (2) a statement that you are in fact the author of all comments on this page, and I will promptly remove them.